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Abstract 



Selective sweeps are typically associated with a local reduction of genetic diversity around 
the adaptive site. However, selective sweeps can also quickly carry neutral mutations to ob- 
servable population frequencies if they arise early in a sweep and hitchhike with the adaptive 
allele. We show that the interplay between mutation and exponential amplification through 
hitchhiking results in a characteristic frequency spectrum of the resulting novel haplotype 
variation that depends only on the ratio of the mutation rate and the selection coefficient of 
the sweep. Based on this result, we develop an estimator for the selection coefficient driving 
a sweep. Since this estimator utilizes the novel variation arising from mutations during a 
sweep, it does not rely on preexisting variation and can also be applied to loci that lack 
recombination. Compared with standard approaches that infer selection coefficients from 
the size of dips in genetic diversity around the adaptive site, our estimator requires much 
shorter sequences but sampled at high population depth in order to capture low-frequency 
variants; given such data, it consistently outperforms standard approaches. We investigate 
analytically and numerically how the accuracy of our estimator is affected by the decay of 
the sweep pattern over time as a consequence of random genetic drift and discuss potential 
effects of recombination, soft sweeps, and demography. As an example for its use, we apply 
our estimator to deep sequencing data from HIV populations. 
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Introduction 



The frequency and strength of positive selection are key parameters of the evolutionary 



process yet reliable estimates are often very difficult to obtain (Nielsen et al. , 2005 Eyre 



Walker, 2006). As it becomes increasingly practicable to analyze the genetic diversity of 



natural and experimental populations at high depth, we can hope to obtain better estimates 
from a detailed analysis of the signatures positive selection is expected to leave in such data. 

The standard model describing the population genetic signatures of positive selection 
is the so-called selective sweep (Maynard Smith and Haigh, 1974). Selective sweeps 



produce distinct effects on the patterns of genetic diversity in the vicinity of the adaptive 
site: (i) Hitchhiking of linked neutral polymorphism with the adaptive allele generates dips 



in diversity that level off with increasing genetic distance from the selected site (Maynard 
Smith and HaighI 119741). (ii 



Because different lineages from a population sample taken 
after the sweep can coalesce almost instantaneously during the early phase of the sweep, there 



should be an excess of singletons in the neutral diversity around the selected site (Slatkin 



and Hudson 


1991 


Barton 


1998 


Durrett and Schweinsberg 


2004 


alleles can be brought to high frequencies ( 


Fay and Wu 


2000 


), which is 



fin) Derived 



random genetic drift alone, (iv) The adaptive haplotype should be longer than expected under 
neutrality since recombination has had only a short time to degrade it (Hudson et al. , 1994). 
These hallmark signatures underlie the majority of approaches to scan for recent adaptation 
in population genetic data 



Hudson et al. 



1987 



Tajima 



1989 



Wiehe and Stephan 



1993 


Fay and Wu 


2000[ 


Sabeti et al] 


2002; |Kim and Stephan, 2002; Przeworski 


2003 


Nielsen et al. 


2005 


Voight et al. 


2006; 


Sabeti et al. 


2007; 


Andolfatto 


. 2007) 



In addition to detecting selective sweeps, one would often like to know the strength of 



selection that drove the sweep (Macpherson et al. , 2007 Hernandez et al. , 2011 Sattath 



et al, 2011). One common approach is to infer selection coefficients s of the adaptive allele 



from the size of the dip of approximate length s/r around the adaptive site, where r is the rate 



of recombination per length in the corresponding region (Maynard Smith and Haigh, 1974 



Kaplan et al, 1989 Kim and Stephan, 2002). Approaches based on this signature rely on 



the interplay between recombination and ancestral variation and are thus not applicable for 
scenarios where recombination or ancestral variation is poorly characterized. Furthermore, 
certain scenarios of adaptation do not always generate clear dips in diversity. Examples 
are incomplete sweeps where the adaptive mutation is not fixed in the population, and soft 
sweeps, where more than one haplotype has swept through the population (Pritchard 



et al, 2010) 



Here, we develop an estimator for selection coefficients at candidate loci where a selective 
sweep has occurred recently. Our estimator is based on the analysis of the novel haplotype 
variation that arises from neutral mutations on the sweeping adaptive haplotype. These early 
variants are very different in kind from the variation due to neutral mutations occurring after 
a sweep, since they have experienced an initial phase of exponential amplification. Mutations 
after the sweep will quickly outnumber the few that happened early during the sweep, but 
they will drift to higher population frequencies comparatively slowly. 

The frequency spectrum of the early haplotype variants is determined by the distribution 
of their seeding times during the sweep, as illustrated in Figure [T] We show analytically that 
their rank-frequency spectrum (the frequencies ordered by decreasing abundance) decays 
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according to a simple power-law, which differs markedly from the approximately exponential 
decay expected under neutral evolution. This power-law is characterized by only a single 
parameter: the ratio of the rate at which new haplotypes arise and the selection coefficient 
s of the sweep. 

We use this result to develop an estimator of s that compares the strength of selection 
to the rate at which novel haplotypes are produced. Novel haplotypes can be produced by 
mutation or recombination. In many organisms, recombination is rare and the mutation 



rate is larger or similar to the recombination rate (Roach et al, 2010 Ossowski et al. 



2010 Haag-Liautard et al., 2007). Hence our estimate is much less sensitive to poorly 



characterized recombination rates, thereby overcoming several problems of estimators based 
on dips in diversity. In particular, our estimator should also be applicable to organisms that 
exchange genomic segments via horizontal transfer (many bacteria), and populations where 
levels of ancestral variation are very low, for instance in experimental evolution with clonal 
populations. 

Our estimator relies on deep diversity data for accurate measurements of the population 
frequencies of rare haplotypes. With large-scale sequencing projects such as the 1000 genomes 
project in humans (|1000 Genqmes Project Consortium et al.\\2010\j and similar projects 



in flies (Drosophila Population Genomics Project 2011) and plants (Cao et al. 



2011) currently under way, such data will become available soon. 




time [generations] time [generations] 



Figure 1: (A) Generation of novel hap lo type diversity from secondary mutations during 
a selective sweep. At t = a beneficial mutation establishes and the frequency, x , of its 
underlying haplotype (blue) rises. After some time, a neutral mutation occurs in an individual 
carrying the beneficial mutation, creating a new adaptive haplotype variant (red) which itself 
increases in frequency, x\. This happens repeatedly, giving rise to a range of low- frequency 
haplotypes that all descended from the initial founding haplotype. Sweep parameters are 
s = 0.03 and u = 0.004, seeding times of new variants are given by Equation Q. (B) The 
distribution of seeding times for different i according to Equation pfc. 
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Materials and Methods 



Forward simulations of selective sweeps 

Simulations were performed with custom written C++ programs that represent each haplo- 
type in the population by a bit string of fixed length L = 1024 or L = 4096. Populations are 
initialized with a sample of size 10 4 individuals obtained from the neutral coalescent using 



the program ms (Hudson, 2002). Note that this initial variation is only needed to allow 
diversification by recombination. We constrained ms to return a sample with 1023 (or 4095, 
respectively) segregating sites, leaving the site in the center of the locus for the beneficial 
mutation. Each genotype sampled from the coalescent is copied N /10 A times into the initial 
population. The beneficial mutation is introduced into one randomly chosen individual in 
generation 0. If the beneficial mutation is lost due to random drift, the population is reset to 
the neutral sample and the beneficial mutation is introduced again until a successful sweep is 
observed. Note that our initial condition with each site polymorphic is not supposed to imply 
that the product of the population size and the per site mutation is much larger than one. 
Instead, those loci correspond to the subset of polymorphic loci scattered along a genomic 
region. The results do not depend on the number of segregating sites in the initial sample, as 
long as the pairwise difference between two randomly chosen genomes is almost always much 
larger than one. We repeated the simulations using only a tenth of the segregating sites and 
obtained similar results (see Supplement). 

The Malthusian fitness of a haplotype is given by F = s or 0, depending on whether or 
not the haplotype carries the beneficial allele at position L/2. In each generation, a pool of 
gametes is produced to which each individual contributes a Poisson distributed number of 
copies with mean exp(F — (F) + C), where C = (1 — N/N ) serves as a carrying capacity 
to keep the population size approximately constant at Nq and (F) is the mean fitness in the 
population. A fraction r of the newly produced gametes are paired at random and crossed 
over at a randomly chosen position (since r C 1, there is no crossover in most gametes). 
At the L — 1 neutral loci, mutations are introduced at random positions into the gametes 
with the specified mutation rate u. When simulating the sweeps of different strength, the 
mutation rate is changed accordingly to keep the ratio of the strength of selection and the 
total mutation rate in the range specified in the figure. This is akin to changing the size of 
the locus simulated, and allows more efficient simulations. The code is available on request. 



To compare our method to previously developed methods to estimate sweeps (Nielsen 



et al., 2005), we generated data using the program msms by Ewing and Hermisson 



sec 



supplementary material. 



Analysis of HIV data 

Sequences from the relevant samples were read by a python script and translated. Sequences 
corresponding to identical amino acid sequences were grouped together. Within each of 
these groups, the number of transitions and transversion that do not change the amino-acid 
sequence were determined. In case the sequence overlapped with the end of the gene and 
extended into the long terminal repeat (LTR), only the aminoacid sequence up to the stop 
codon was used to group sequences and mutations in the LTR treated as neutral mutations. 
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Sequences from the study on early immune system escape mutations by Fischer et al. 



(2010) are available from the NCBI short read archive under accession number SRA020793. 
Sequences from the study on drug resistance evolution by Hedskog et al. (2010) were ob- 
tained from the authors. 



Results 

Haplotype frequency spectrum of a selective sweep 

Consider a new adaptive mutation with selection coefficient s > in a panmictic haploid 
population of constant size N. We assume that selection is strong (Ns ^> 1). Once an 
adaptive mutation becomes established in the population by reaching a population frequency 



x ~ (2Ns) 1 that assures its escape from future stochastic loss (Maynard Smith, 1971) 
its frequency trajectory can be modeled by logistic growth 



Let us first assume that recombination is infrequent compared to mutation (we will discuss 
recombination below). If neutral mutations occur during the early phase of the sweep on the 
adaptive haplotype, they can generate new variants of the adaptive haplotype that can also 
rise in frequency (Figure [l]^-). We adopt an infinite haplotypes model where every such 
mutation creates a distinct haplotype. When the frequency x(t) of the beneficial allele is still 
small, novel adaptive haplotypes become established in the population at an approximate 
rate a(t) = 2s x uNx(t), where u is the rate at which neutral mutations occur on the sweeping 
haplotype per generation. The factor 2s accounts for the establishment probability of those 
new haplotypes. After the sweep is completed, a novel haplotype variant will be the more 
frequent the earlier it arose in the sweep. Thus, to understand the spectrum of haplotype 
frequencies, we have to study the distribution of times at which these haplotypes are seeded. 

Since novel haplotypes are seeded in independent events, the total number of established 
new haplotypes up to time t is Poisson distributed with parameter 

X(t) = tatt'W « -e st . (2) 
Jo s 

The approximation assumes that the relevant haplotypes are seeded while the sweeping allele 
is still rare and increases exponentially, which is appropriate as long as t <C \og(Ns)/s, and 
that s»u, which can be ensured by reducing the size of the locus under investigation. 

The probability density that haplotype i is seeded at time t is given by the rate a(t) of 
establishing new haplotypes multiplied by the probability of having i — 1 haplotypes at time t 

Mt) = a(t f~ X{t)X{ ^ 1 (3) 

where we have used the same approximations as in Equation ([2]) and dropped factors inde- 
pendent of t that ensure normalization. This distribution Pi(t) is shown in Figure [lb for 
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different i. The intervals between the peaks of Pi(t) and pi + i(t) become smaller with increas- 
ing i, while the width of each peak decreases. More precisely, we derive from Equation (J3]) 
that the most likely seeding time of the ith haplotype is given by 

t; = -jog(-), (4) 



s \ u 



and that the peak of Pi{t) has a width of <7j ~ (sy/i)' 1 (if i ^> 1). The joint distribution 
of seeding times as well as the unordered frequency spectrum are derived in Supplementary 
Information, section 2. 

Assuming that the ith haplotype establishes at t* and has same fitness as the founding 
haplotype, it will also rise in frequency, albeit with a time-lag t*. Together with Equation Q 
we obtain an approximation for the expected frequency of the ith haplotype 

,(s-u)(t-t*) /7/\l-- 



xAt) = — t —^e- ut - . (5) 

v ' e st + 2Ns VisJ v ; 

The limit corresponds to the asymptotic behavior as the beneficial allele approaches fixation. 
Note that this expression holds for haplotypes i > 1. The zero-th haplotype, i.e., the haplo- 
type the mutation inititally arose on, grows as x$ (t) = (e {s - u)t )/(e st + 2Ns) and asymptotes 
to e~ ut . Each of the haplotype variants carrying the adaptive allele thus grows at a slightly 
smaller rate, s — u, than the overall rate s at which the adaptive allele rises, accounting for 
the founding of new haplotype variants by mutation. 

The interplay between the exponential amplification of the adaptive allele and the gener- 
ation of new adaptive haplotype-variants thus gives rise to a simple power-law decay of the 
rank-frequency spectrum {xo,xx, • • • } of adaptive haplotypes: the most abundant adaptive 
haplotype, on average, is typically s/u times more frequent than the 2nd most abundant 
adaptive haplotype, 2s /u times more frequent than the 3rd most abundant adaptive haplo- 
type, and so forth. This power-law spectrum with exponent /3 = 1—u/s differs markedly from 
that of neutral evolution, where haplotype frequencies are expected to decay as Xi/x ~ e~ 1 / 
with haplotype rank % (0 = 2Nu; see Supplementary Information, section 1). 

The distribution of haplotype frequencies after a sweep is related to the distribution 



of "family sizes" studied by Barton (1998). Barton presented numeric results for the 



distribution of the size of groups of individuals that share a common ancestor at the beginning 
of the sweep. Haplotype frequencies, however, refer to the size of groups identical by state, 
rather than descent, and characterize diversification that happened after the sweep, rather 
than the ancestral variation that survived the sweep. 

In order to compare the haplotype counts no > n\ > n 2 . . . in a sample of size n to Equa- 
tion ([5]), we need an estimate of e~ ut . The simplest estimate of e~ ut is the sample frequency 
of the most abundant haplotype, n /n, whose deterministic value would be precisely e~ ut . 
However, no can be quite variable because the variances of the first few seeding times are still 
large. To circumvent this problem, one can construct a more stable estimate of e~ ut based 
on the first k haplotypes and compare the sum of the sample frequencies to the sum of the 
deterministic expectation of the frequencies: 



no - ~ — k — 77Vi9 ' W 
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where /3 = 1 — u/s. This effective no is insensitive to k as long as the first few haplotypes 
are included. Figure [2]\ shows a comparison of the simulated haplotype spectra with 

n - » (-Y ■ (7) 
uq \is/ 

The logarithmic axes of the plot are chosen such that the algebraic decay shows up as a 
slope — p. The observed slope is correct initially, but the curves become slightly flatter after 
haplotype rank « 50. This is due to degradation of the sweep pattern by random genetic 
drift, which we will discuss next. 




Figure 2: (A) Ensembles of 5 haplotype rank-frequency spectra for u = 0.001 and two 
selection coefficients s = 0.1 and s = 0.01. Spectra were estimated from samples of size 
n = 10 4 taken when the allele frequency of the beneficial mutation reached 0.99. The 
deterministic expectations given in Equation ^ are shown as dashed lines. The normalizer 
h was calculated according to Equation (|6| with k = 4. (B) Relaxation of haplotype spectra 
to the neutral spectrum. Rank-frequency spectra were estimated from samples of size n = 10 4 
when the beneficial allele reached population frequencies 0.9, 0.99, 1.0, and for several time 
points after completion of the sweep (time in generations) (s = 0.001, u = 10~ 5 , L = 1024, 
iV = 10 6 , = 2Nu = 20). Each curve is the median of 50 simulated sweeps. The most 
common haplotypes (rank 0, not represented on log scale) are indicated by circles. Right 
after the sweep, the population consists of one very common haplotype, x$ ~ 0.98, and a 
large number of rare variants whose frequency decays as predicted by Equation Q (lower 
dashed line). Over time of order t rs N, the frequency spectrum relaxes to the expected 
neutral spectrum (upper dashed line). 



In deriving Equation ([7]), we have neglected fluctuations in the seeding and establish- 
ment times of novel haplotype variants as well as genetic drift after the sweep is complete. 
Relaxing these assumptions requires a stochastic calculation, which is given in the Supple- 
mentary Information, sections 3A and B. Specifically, we calculate the probability of finding 
i c haplotypes to be present in more than n c copies in the population, assuming the logistic 
trajectory of the beneficial allele given in Equation 0. These calculations are lengthy, but 
the resulting effects can be understood by the simple arguments given below. 
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Variations in the establishment times will cause the haplotype spectra to fluctuate and 
result in spectra below or above Equation ^ if new haplotypes were seeded fortuitously early 
or late. As expected from the distribution of seeding times in Figure [TJ the frequencies of the 
first few haplotypes fluctuate substantially, while the later ones do not. Since the number of 
haplotypes seeded by time t is Poisson distributed, so is the number of haplotypes above a 
certain frequency after the sweep is complete (neglecting genetic drift, there is a one to one 
mapping between establishment time and eventual frequency). Hence, the fluctuations due 
to random seeding times are small if the expected number of haplotypes above the chosen 
frequency is large. In accordance with these arguments, variation of haplotype spectra in 
Figure [2]A. decreases with increasing i. 

Once the beneficial allele approaches fixation, the strength of selection goes to zero, and 
exponential amplification ceases. The frequencies of the rare haplotypes thereafter change 
primarily due to random drift, while the frequencies of the common haplotypes decrease due 
to additional mutations that produce new rare haplotypes. Both of these processes lead to a 
homogenization of haplotype frequencies, i.e., common haplotypes becoming rarer and rare 
haplotypes becoming more common, which ultimately wipes out the sweep signature. This 
relaxation to the neutral haplotype spectrum is shown in Figure |2]B. 

Since the time required to drift to frequency x c is approximately Nx c , the haplotype 
rank-frequency spectrum will soon develop a bulge of drift dominated haplotypes at low 
frequency and high rank. This accumulation of rare haplotypes due to genetic drift can 
be calculated perturbatively (see Supplementary Information, section 3). For the expected 
number of haplotypes i c with frequencies above x c , one obtains 



,. , u 



e -ut 



(x c - At/N) 



l+u/s 



where At is the age of haplotype i c which is of the same order as the age t of the sweep. 
Note that Equation ^ is essentially Equation ^ solved for i with the "drift contribution" 
At/N subtracted from the frequency x c . Since the age At of the haplotype is similar to that 
of the beneficial allele, t, we conclude that for frequencies less than t/N, the sweep spectrum 
is eroded, while for frequencies much larger that t/N, it prevails. After a time of order N, 
the entire spectrum has relaxed to the neutral haplotype frequency spectrum as shown in 
Figure [2]B. Note that the time it takes for the beneficial allele to go from frequency 0.9 to 
complete fixation can be long, and substantial erosion of the sweep spectrum can happen 
during this time interval. 

In general, the time that has elapsed since the beginning of the sweep is unknown. To 
obtain an estimate of the age of the sweep, it is useful to reconsider Equation d5|), which 
states that the frequency of the founding haplotype is expected to asymptote to e~ ut . This 
behavior, and in particular the more accurate estimate of e~ ut given in Equation ([6]), allows 
one to obtain a rough estimate of the sweep's age. 

In addition to genetic drift, limited sampling of the population results in a deviation of the 
observed from the expected sweep spectra. The detected haplotype counts are a convolution 
of the their true frequencies with the distribution expected from sampling the population. 
After ranking of haplotype counts, a large number of rare variants leads to a flattening of 
the spectrum, and in particular an excess of singletons. 
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Estimating selection strength 

According to Equation Q , the expected rank-frequency spectrum of haplotypes in a selective 
sweep is determined by the single parameter u/s - the ratio of the mutation rate over the 
locus and the strength of selection. Given an independent estimate of u, one can therefore 
use the haplotype frequency spectrum to estimate the selection coefficient of a sweep, for 
example by simply counting the number of different haplotypes present above a specified 
frequency cutoff in the sample. Let i c be the overall number of different haplotypes which 
are present in at least n c copies each. The estimator for the strength of selection is then 



where hq is either set to the observed no or determined via Equation ([6]). 

The estimator s allows us to either fix n c and then determine the count i c from the 
data, or fix i c and then measure n c . In either case, the cut-off should be chosen to achieve 
maximum accuracy. Since the number i c of haplotypes above a frequency threshold is Poisson 
distributed, fluctuations of s decrease with smaller n c and larger i c . To low i c , however, will 
include parts of the frequency spectrum that has already been degraded by drift, which 
predominantly affects the rare haplotypes. In addition, limited sampling depth limits i c . 
Hence i c should be chosen as large as possible, but such that n c ^> 1 and the frequency 
spectrum is still of power-law form down to haplotype i c . In this case, the relative error of s 
is of order 1/ ' \[% c (assuming the model assumptions are met). 

It is generally advisable to fix i c and determine n c = rii c because spectra come flatter than 
i^ 1 due to the confounding effect of genetic drift and limited sampling of the population. 
Figure [3] shows the performance of our estimator when applied to simulated sweeps (see 
Methods) for different selection coefficients and mutation rates as well as its dependence on 
the choice of i c . The simulations confirm that our estimator performs accurately over the 
range of moderate selection (s ~ 10~ 3 ) to strong selection (s ~ 0.2) given the parameters 
used. However, there is a systematic downward bias for small s. This bias is due to genetic 
drift and limited sampling depth. 

In the Supplementary Information, section 3, we derive an estimator that accounts for 
genetic drift perturbatively, see Equation (24) of the supplementary information. In essence, 
this estimator contains the same correction present in Equation (J8|, i.e., it subtracts the drift 
contribution t/N from the frequency of the rare haplotypes. Importantly, this drift correction 
sets the limits of applicability of s: (i) After the completion of the sweep, we can estimate 
s only for a limited time. Specifically, for a given sample frequency n c /n, we require that 
t < Nn c /n. (ii) The range of s that can be reliably estimated by the method is limited: even 
if we catch the sweep right at the time of fixation of the adaptive allele, it still needs to have 
occurred faster than the time scale of its own degradation by drift. Given that the time it 
takes to complete a sweep is approximately (2/s) log Ns and that n c /n w uj (si c ), we require 
N > (2i c /u) log Ns. Together with s ^> u, we find that the range of s that can be estimated 
is bounded from below by Ns ^> i c logNs. This estimate is consistent with the deviations 
in Figure |3] for s < 10~ 3 . 




(9) 
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Figure 3: (A) Mean s over 50 simulation runs for a wide range of selection strengths and 
three different ratios of u/s using Equation ^ with i c = 5. Error bars are ± one standard 
deviation. The dashed black line indicates the correct s. Population samples of size n = 10 3 
were taken when the beneficial allele reached fixation. For small s, there is a slight downward 
bias due to genetic drift. L = 4096 and N = 10 6 . (B) Distribution of s at different selection 
strengths for fixed u/s = 0.05. (C) Performance of s as a function of i c . There is a tendency 
to underestimate the strength of the sweep if i c is large, which arises from the inclusion of 
rare haplotypes that are affected by genetic drift and may contain haplotypes that arose 

1/2 

after the sweep. (D) Standard deviation over 50 estimates as a function of i c ■ Variation 
is suppressed at large i c due the discreteness of low population frequencies. 

Recombination 

Up to now, we have neglected any potential effects of recombination on the frequency dis- 
tribution of the new adaptive haplotype variants arising in a selective sweep. In fact, it is 
one of the advantages of our estimator that it does not exclusively rely on recombination 
for its inference of selection coefficients and can thus also be applied to systems which lack 
recombination or where recombination rates are not well known. However, if recombination 
occurs on the sweeping haplotype, this can generate new variants of the adaptive haplotype 
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in a manner similar to mutation. This becomes important whenever the recombination rate 
r over the locus is comparable to its mutation rate u. In this case, the rate at which new hap- 
lotype variants are generated during the sweep will actually be higher than assumed based 
on the mutation rate u, and we thus expect s to underestimate the strength of the sweep. 

We propose two ways to incorporate recombination: First, one can treat recombination 
analogously to mutation and assume that every recombination event generates a new variant 
of the adaptive haplotype that will be different from all other existing variants. Under this 
assumption, the mutation rate u simply needs to be replaced by the sum u + r. Figure [4]A_ 
shows how our estimator performs for different ratios of mutation to recombination rate 
under this approach. The infinite haplotypes assumption is appropriate if ancestral diversity 
is high and independent recombination events are thus unlikely to yield equal outcomes. If, 
however, ancestral diversity is low, the infinite haplotypes model will overestimate the rate 
at which new haplotypes arise. 




io-' io -2 io _1 i(r 3 iir 2 id 

actual selection coefficient actual selection coefficient 



Figure 4: Estimating selection coefficients in the presence of recombination. (A) Performance 
of our estimator s using Equation (|9j) with diversification rate u + r for different r with 
u/s = 0.05. The slight overestimation at high recombination rates is due to fact that high 
and low frequency haplotypes degrade at slightly different rates due to the possibility of 
recombination with an identical haplotype. (B) Estimation of s from haplotype spectra 
restricted to those haplotypes that differ at only one site from the most abundant haplotype. 
After pruning, there are fewer abundant haplotypes and the estimates are more sensitive to 
genetic drift. 

Alternatively, we can try to effectively exclude most recombined adaptive haplotype vari- 
ants. Recombined variants are likely to differ from the founding haplotype at more than one 
site, whereas variants originating from mutation should differ at only one site. By filtering 
out the former, we can treat the remainder as originating primarily from mutation events. 
As shown in the Supplementary Information, section 4, such a set of pruned "mutation-only" 
haplotypes has a slightly different rank-frequency spectrum: The exponent /3 is exactly 1, 
rather than 1 — u/s. Besides that, Equation ^ can be applied with u being the mutation 
rate and n c being the abundance of the i c haplotype in the pruned set of haplotypes. The 
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estimates obtained by this procedure are shown in Figure |4|3. While this modified estima- 
tor still seems to underestimate selection coefficients slightly, it performs consistently across 
different recombination rates. 

The difference in the exponent of the rank frequency spectrum arises from slightly dif- 
ferent rates of seeding and amplification of new haplotypes. Without pruning, the rate of 
establishment of novel haplotypes is proportional to the frequency of the beneficial allele, 
which follows a logistic growth with rate s. Haplotype frequencies, however, only grow with 
rate s = s — u — r. This slight difference in rates is responsible for the deviation of the expo- 
nent /3 from 1 by u/s or (u + r)/s, where the latter also accounts for recombination. When 
restricting the analysis to haplotypes that descend directly from the founding haplotype, the 
rate of establishment of novel haplotypes is proportional to the frequency of the founding 
haplotype. Hence, establishment and amplification happen with the same rate s = s — u — r 
and the exponent is exactly 1. 

We note another minor difference in the dynamics of haplotype frequencies in the presence 
of recombination. When the beneficial allele reaches fixation, the most abundant haplotype 
will most likely recombine with itself. Thus it diversifies more slowly after the sweep is 
complete. Minor haplotypes, however, continue to diversify through recombination, which has 
the effect of slowly decreasing their frequency and producing novel low frequency haplotypes. 
This slow decrease opposes the effects of genetic drift, which has the tendency to increase the 
frequency of minor alleles. In simulations, estimates of s often become more accurate due to 
this effect. 



Application of s to deep sequencing data from HIV 

Deep population diversity data of the kind required for the application of our estimator have 
recently been obtained for HIV populations by sequencing viral RNA from plasma samples 



(Tsibris et al, 2009; Fischer et al, 2010 Hedskog et al, 2010). We will present three 



examples from such studies to validate our method and discuss its applicability. 

As a first example, we investigated a sample taken shortly after infection when HIV 
evolution is primarily driven by mutations that allow the virus to escape the immune system. 



For each of the patients studied in Fischer et al. (2010), samples from several time points 
were investigated and amplicons sequenced using 454 technology. In one of the epitopes 
studied (gene nef in patient CH40), a mutation spread sufficiently slowly such that it was 
possible to observe the mutation rise from low frequency at day 16 to high frequency at day 
45 (Figure |5K). From this time series Fischer et al. (2010) estimate a selection coefficient 



of 0.3, assuming a generation time of HIV between 1.5 and 2 days (Perelson et al, 1996) 



To validate our estimator, we can compare this direct estimate of the selection coefficient 
obtained from time-course data with that obtained from the frequency spectrum of haplo- 
type diversity at the locus, which is shown in Figure [5|3. The figure contains the frequency 
of haplotypes with the dominant amino acid sequence which differ only by putatively neutral 
synonymous mutations. The haplotype rank-frequency spectrum according to Equation ^ 
suggests a ratio of u/s ~ 0.014. There is considerable uncertainty in the mutation rate esti- 
mates for HIV, ranging from 3.4 x 10~ 5 (IMansky and Temin[|1995|) to 9 x 10 -5 per site and 



generation (O'Neil et al. (2002) for mutations in the LTR). The majority of these mutations 
are transitions. In our example the sequenced locus tolerates a total of 93 transitions that do 



13 




Figure 5: The rise of a mutation selected by the immune system in the gene Nef of patient 
CH40 from Fischer et al. (2010). (A) Frequency trajectory of the adaptive allele with 
fitted curve for s ~ 0.3. (B) Observed haplotype spectrum of variation due to synonymous 
transitions from all sequences containing the selected allele at day 45. The straight line 
indicates the 1/i behavior. The pairwise nucleotide distance between all haplotypes ordered 
by abundance is shown in the inset. The error of the estimate are obtained assuming Poisson 
statistics with i c = 5. Note, however, that the uncertainty of the mutation rate is of the 
same order. 



not change the amino-acid sequence of Nef or fall into the LTR which overlaps the sequenced 
region. Using 4 x 10~ 5 per generation as an averaged estimate for the transition rate, we 
arrive at 

s « 93 Von " 5 gen_1 = °' 27 gen_1 ± °- 12 " (10) 
The uncertainty stemming from the random times at which the haplotypes are seeded is 
expected to result in relative errors ~ which for i c = 8 used here amounts to ±0.12. 

Additional uncertainty (probably larger) in the mutation rate needs to be added to these 
error bars. Given these uncertainties we consider our estimate in excellent agreement with 
the independent estimate from time-course data. Note that recombination is not expected to 
make a large contribution to haplotype diversification since the effective HIV recombination 
rate is less than half the mutation rate ( TNeher and Leitner] |2010[ |Batorsky et al] |2011[ ) . 
Furthermore, the HIV population has undergone several sweeps prior to the date this sample 
was taken such that the ancestral diversity was very low. 

We also checked whether the haplotype variation in our sample is consistent with the 
characteristic star-phylogeny expected for mutations that arose on the background of the 
sweeping founder haplotype. In particular, under the assumption of an infinite sites model 
where each such mutation gives rise to a new haplotype, we would expect that descendants 
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s = 0.037 ± 0.013 

• — • Sweep 2 

s = 0.051 ± 0.018 
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Figure 6: (A) Exemplary sweep in HIV evolution. The rank-frequency spectrum and pair- 
wise distance matrix are shown for the different haplotypes coding for the same amino-acid 



sequence aa 180-220 of the RT (Hedskog et al, 2010), comp. Figure 5 All but haplotype 13 



differ from the most abundant haplotype (0) by one mutation. Haplotype 13 is most likely a 
descendent of haplotype 1. (B) Example of a soft sweep in HIV evolution. In this sample, the 
two most abundant haplotypes (0) and (1) differ by more than one mutation (top left corner 
of left inset, showing the distance matrix of the 10 most abundant haplotypes), suggesting 
that the selected mutation arose independently on different backgrounds. By inspecting the 
first two columns of the left distance matrix, one finds that most of the other haplotypes 
differ by one mutation from either haplotype (0) or (1), consistent with the beneficial muta- 
tion having arisen independently on haplotypes (0) and (1). Assigning the rare haplotypes to 
either of the two abundant "founding" haplotypes and reordering yields the distance matrix 
shown as the right inset, this time including all 36 haplotypes. The reordered matrix displays 
two characteristic sweep patterns as blocks on the diagonal. Indeed, the haplotypes of each 
of these two blocks display the characteristic power-law rank-frequency spectrum. In both 
panels, i c = 8 and relative errors are 1 / a 



should always be one mutation away from the founding haplotype, while any two descendants 
should be two mutations away from each other. Our example shows precisely this pattern, 
as can be seen in the inset of Figure [5]A This is further evidence that recombination did not 
contribute substantially to haplotype diversification. 

As a second example, we consider deep sequencing data of the reverse transcriptase (RT) 
locus of HIV (120 bp, amino acids 180-220) QHedskog et al.\ |2010| ). The patients analyzed 
in this study were under anti-retroviral treatment, which implies strong selection pressure for 
resistance mutations at the RT locus. During the course of treatment resistance has evolved 
repeatedly in several patients. We therefore expect to see signatures of strong selective sweeps 
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in these viral populations. 

As before, we restricted our analysis to synonymous differences because our model assumes 
selectively neutral mutations. The 120 bp analysis windows typically contained enough vari- 
ation such that in most samples more than 8 different haplotypes were sampled. Hence 
we choose i c = 8 for estimating s and then measured the cutoff frequency n c of the least 
abundant haplotype to evaluate Equation (j9|. 

Figure [6}\ shows the measured haplotype rank-frequency spectrum for one of the patient 
samples (time point 4, patient 4). The spectrum decays with the characteristic power-law of 
a selective sweep. As discussed above, we determined the number of synonymous transitions 
in the founding sequence and used 4 x 10~ 5 as the rate of individual transitions. With 
this mutation rate estimate, we arrive at s ~ 0.07 per generation. Note that the causative 
mutation, or the combination of mutations, that drove this sweep does not necessarily have 
to lie inside the analyzed window but could also be located elsewhere in the genome. 

Figure [6j3 shows haplotype data from another population sample (time point 4, patient 
5). In this third example, the pairwise mutational distances between different haplotypes 
reveal a more complex pattern (left inset) that does not seem to be compatible with the 
simple star phylogeny observed in the example from Figure [6]A However, when we focus on 
the two most abundant haplotypes and 1, we see that these two haplotypes differ at more 
than one site from each other. The majority of the remaining haplotypes differs by only one 
mutation from either haplotype or 1. This pattern can be interpreted as the result of two 
independent partial sweeps where the two founding haplotypes (0 and 1) differ at two sites. 

We can disentangle these two hard sweeps by assigning the rare haplotypes to either of the 
two founding haplotypes if they from the latter by only one mutation. When reordering the 
distance matrix according to this assignment, we recover the two individual sweep pattern 
as blocks on the diagonal. This is shown in the right inset of Figure [6j3. Both reordered 
components now exhibit the characteristic power-law decay in their individual rank-frequency 
spectra with similar selection coefficients, again in the range of a few percent. 

A scenario as the one shown in Figure [6j3, with several haplotypes carrying the same 
adaptive mutation rising in frequency simultaneously, is commonly referred to as a "soft 
sweep" (Hermisson and Pennings, 2005). Soft sweeps can occur, for example, if an adap- 
tive mutation arises repeatedly on different haplotypes, which is expected to be common 
in viruses with large population sizes and high mutation rates such as HIV, where every 
single point mutation can arise many times each generation. Alternatively, soft sweeps can 
result when a sudden change in environment renders a previously neutral or deleterious allele 
selectively beneficial and adaptation involves several different adaptive haplotypes from the 
standing genetic variation. 

An unambiguous decomposition of the haplotype distance-matrix into individual sweep 
components of a soft sweep requires that the founding haplotypes differ by several mutations. 
This is likely to be the case for a diverse ancestral population. Note that the above approach 
also suggests how s can be applied to incomplete sweeps or sweeps restricted to a subpopu- 
lation. In those cases, the reordered distance matrix should display the characteristic sweep 
pattern only for a sub-block on the diagonal that is embedded into the diverse remainder of 
the population. 

Intriguingly, we see very little evidence of degradation of the haplotype spectrum by 
genetic drift. This might have several causes: (i) a very large population size, (ii) a very 
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recent sweep, or (iii) a scenario where sweeps are so frequent that they start overlapping such 
that exponential amplification of the most fit variants never ceases. The latter is expected in 
large continuously adapting populations (Neher and Shraiman, 2011). To investigate this 
matter further, one would need denser time-course data and information on genetic diversity 
across the genome. 



Discussion 



We have investigated the pattern of haplotype variation that arises from mutations occurring 
during the early phase of a selective sweep on the sweeping haplotype. We found that 
a selective sweep leaves a characteristic signature in the frequency spectrum of such novel 
haplotypes: when ordering all adaptive haplotype variants by their abundance, the frequency 
of the zth haplotype variant is approximately u/(is) times the frequency of the first, where 
u is the mutation rate of the locus and s is the selection coefficient of the sweep. The power- 
law decay of the rank-frequency spectrum is a consequence of exponential amplification (via 
selection) competing with diversification (through mutation and recombination), similar to 



the mechanism of preferential attachment originally proposed by |Yule (1925). We applied 
this finding to construct an estimator for the selection coefficient at a candidate locus where 
a selective sweep is suspected to have occurred recently. Such loci could be either identified 



by standard approaches to localize sweeps (Fay and Wu 



and Stephan, 2002 Nielsen et al, 2005 Voight et al 



2000 



2006 



Sabeti et al, 2002; Kim 



Sabeti et al, 2007), or a 



priori information could suggest that a sweep has occurred, such as in our HIV examples, 
where the failure of anti-retroviral therapy implied an adaptive change at the RT locus. 

The classic signature used to infer the strength of a recent sweep has been the size of the 
dip in diversity around the adaptive site (Maynard Smith and Haigh, 1974). Underlying 



this approach is the rationale that recombination breaks up linkage with increasing genetic 
distance from the adaptive site: The further away from the sweeping locus, the higher the 
probability that pre-sweep variation has become unlinked from the adaptive allele and thus 
remains polymorphic after the sweep. Hence, the width of the dip in genetic diversity scales 
with the ratio of the selection coefficient and the recombination rate. 



While successfully applied in many studies ( 


Kim and Stephan 


2002; 


Andolfatto 


2007 


Macpherson et al, 


2007 


Sabeti et al. 




2002, 


2007 


Sattath et al., 


2011 


), this 



approach suffers from a number of shortcomings that can limit its applicability. First, it 
relies on recombination and thus cannot be applied in organisms which frequently self, such 
as many plants or yeast, as well as organisms that recombine via horiontal gene transfer, 
such as bacteria. Recombination rates can also fluctuate strongly along the genome and 
in time (IWinckler et al] 120051 ICoop and PrzeworskiI 120071), rendering their precise 



estimation difficult, especially in the regions of reduced genetic diversity around a selective 
sweep. Furthermore, the approach relies on the presence of pre-existing variation at the 
sweep locus and assumes that we know how much variation was present originally. Ancestral 
diversity is literally absent in evolution experiments where populations start from a single 
clone, or in populations where adaptation occurs so frequently that genetic diversity is not 
fully restored between recurrent selective sweeps (Gillespie, 1994). Finally, for strong 



sweeps, the dips in diversity can potentially span very large regions extending up to entire 
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chromosomes (Andersen et al, 2012). In such cases, dip-sizes can no longer be assessed 
accurately. 

Our estimator is less dependent on ancestral diversity or accurate recombination rates 
estimates, since we investigate the new variation that emerges during the sweep and compare 
selection strength to rate of haplotype diversification (mutation and recombination) rather 
than to the recombination rate alone. It is essential for our estimator that one can accu- 
rately determine haplotype population frequencies on the order of u/(si c ), requiring deep 
population samples. For example, if we assume a sweep locus with u/s = 0.1 and base our 
analysis on the five most frequent adaptive haplotype variants (i c = 5), it would be nec- 
essary to measure haplotype frequencies of 2% accurately. This calls for a sample size of 
roughly 10 3 . Population genetic data of such depth is already available for HIV, where a 



large number of sequences can be obtained from plasma samples of infected patients (Tsib 



RIS et al., 2009 Fischer et al, 2010; Hedskog et al, 2010). Several efforts are currently 



being made to achieve a comparable in-depth characterization of the population diversity in 



several eukaryotes including humans ( 


1000 Genomes Project Consortium et al. 


flies ( 


Drosophila Population Genomics Project 


2011 


) , and plants ( 


Cao et al. 



2010) 



2011[ ). 

In practice, one typically has a data set (as in the HIV example a sample of size ~ 10 4 
sequences of length 120-200bp) and wants to estimate s. The size of the locus corresponds 
to a certain mutation rate, which will limit the range of selection coefficients that give rise 
to an observable sweep spectrum: If u/s is too small (say smaller than 10 times the inverse 
sample size), very little variation will be observed. On the other hand, if u/s is close to 
one, the founding haplotype diversifies as rapidly as it is amplified and will no longer be the 
dominant haplotype in the sample. Without a dominant haplotype, the method cannot be 
applied. Furthermore, our calculations always assume that u <C s. 

Too large values of u/s can be circumvented by restricting the analysis to only a fraction 
of the locus, which effectively reduces u. Hence, if a sample contains a large number of rare 
alleles and haplotypes but no dominant haplotype, one can reduce the window size to see 
whether a dominant haplotype with a trailing sweep spectrum emerges. Similarly, if one 
analyzes long genomes using a windowing approach and observes a region almost devoid of 
diversity (corresponding to u/s 1), one can increase u by increasing the window size 
until several low frequency haplotypes are observed. One can thus tune the sensitivity of the 
estimator to different ranges of s by adjusting u through the length of the locus. A selection 
coefficient of s — 10 -3 and a mutation rate of 10~ 8 per site, for example, would require a 
locus of length 10 kb to achieve u/s = 0.1. 



We used the popular program sweepf inder (Nielsen et al. , 2005) to compare the perfor- 
mance of our estimator to a traditional approach where the selection coefficient of a sweep is 
inferred from its signature in the surrounding ancestral diversity. As expected, our estimator 
consistently outperforms the traditional approach if ancestral diversity is low (Supplemen- 
tary Fig. 2). Interestingly, even when ancestral diversity is rather high (e.g. O = 0.01, 
comparable to the level of neutral diversity observed in D. melanogaster), the estimates from 
our method still have substantially lower variance than those obtained from sweepf inder. 
Note, however, that the comparison between the two methods is based on rather different 
data. While the sweepf inder requires long sequences (r/s m 1) at moderate coverage, our 
methods works with much shorter regions ((u + r)/s < 0.1) at very deep coverage. 

In order for our estimator to be applicable, haplotype variants that were produced after a 
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sweep and rose in frequency by random genetic drift need to be still at low-enough frequency 
such that they have not yet degraded the sweep signature. The extent of this degradation can 
be estimated from a simple argument: the lowest population frequency entering our estimator 
is of the order u/(si c ), while drift-dominated haplotypes will typically be at frequencies t/N, 
where t is the time that has elapsed since the sweep. We thus require u/(si c ) ^> t/N, which 
translates into the condition that population sizes have to be sufficiently large and that sweeps 
are not too old. As shown in Figure [2j drift results in a growing bulge at low frequencies 
where the rank spectrum is exponential rather than a power-law. Since the strength of 
genetic drift is very poorly known, we propose to inspect the ranked haplotype spectrum 
for deviations from the power-law and choose i c small enough such that drift dominated 
haplotypes are excluded from the analysis. In addition, one should check whether the sample 
is compatible with a near star phylogeny. Failure to exhibit these features should indicate 
either the absence of a selective sweep, or a sweep that has already been degraded. 

While we have focussed on mutation as a source of novel haplotypes, recombination can 
produce new haplotypes as well. Provided each recombination event results in a unique 
haplotype, all of the above formulae hold with u + r instead of u alone. If, however, recom- 
bination rates are not known, recombinant haplotypes can be filtered out by restricting the 
analysis to only those haplotypes that differ by a single mutation from the founding hap- 
lotype. This removes the majority of recombinant haplotypes since recombination with the 
diverse ancestral population typically incorporates several polymorphisms at once. 



Recent studies suggest that in many selective sweeps the adaptive allele has not actually 
become fixed in the population (incomplete sweep), or that several different haplotypes, 
all carrying the adaptive allele, have swept simultaneously through the population (soft 



sweep) (Hermisson and Pennings, 


2005 


Pritchard et al. 


2010 


Karasov et al. 


2010 


Burke et al. 


2010 


). We have demonstrated how incomplete sweeps and soft sweeps can be 



analyzed by our method (Figure j6p), making use of the fact that the novel haplotype variants 
our estimator is based on are related to the founding haplotype by a simple star phylogeny, 
which allows to easily differentiate them from other haplotypes that are not descendants of 
the founding haplotype. 



Our results on the haplotype pattern of a selective sweep were derived under the as- 
sumption of a panmictic population of constant size, raising the question how they might be 
affected by past demographic events such as population expansions or bottlenecks. Most cur- 
rent approaches to investigate selective sweeps rely on the specific patterns a sweep leaves in 
ancestral neutral diversity. These approaches can thus be very sensitive to past demographic 
events that shape the patterns of ancestral diversity over a time scale of neutral coalescence, 
which can include quite ancient demographic events. The approach presented here, however, 
is fundamentally different. In contrast to ancestral neutral variation, which is typically old, 
we focus on the very recent variation that arises during the early phase of a selective sweep. 
Demographic events that occurred prior to the onset of the sweep are thus irrelevant. Only 
very rapid changes in population size that happen while the adaptive allele is sweeping can 
cause significant deviations in the haplotype frequency spectra. 

The key scenario to be discussed in this context is that of a recent population expansion, 
since our estimator measures specifically the rate at which new haplotype variants were am- 
plified. Indeed, the haplotype frequency spectrum in an expanding population will resemble 
that of a selective sweep if the expansion lasted long enough for all coalescence to happen 
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during the expansion and if the expansion rate e is large enough such that the spectrum has 
not yet been eroded by drift, i.e., Ne ^> 1. In this case, our estimator turns into an estimator 
of the expansion rate that might be applicable to scenarios such as the expansion of an HIV 
population in a newly infected individual or the spread of novel strains of influenza. If, on top 
of an expansion, a beneficial mutation is spreading through the population, the haplotype 
carrying the beneficial mutation (and its descendants) is expanding faster than the ancestral 
haplotypes. The estimate of the selection coefficient will then in fact be an estimate of s + e. 

The spread of a beneficial mutation in the population generally reduces genetic diversity 
in the vicinity of the adaptive site. That a selective sweep can also amplify new diversity 
at very low population frequencies is thereby often overlooked. We have shown that the 
spectrum of this new variation records the exponential amplification of the novel beneficial 
allele in a clock-like fashion and can thus be used to estimate its selection coefficient. With 
the recent advances in sequencing technologies, the required information about low-frequency 
genetic variation is no longer elusive, making our estimator applicable for a wide range of 
analyses. 
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A Neutral haplotype spectrum 



The haplotype spectrum expected in a haploid neutral Fisher- Wright model without recom- 
bination can be calculated from the Ewens sampling formula (?). Ewens showed that the 
probability of a sample of size n is 



P„(a 1) a 2) ...,a n ) = — J] - -7, (H) 



@(n) AA 1 \ m / fl r 
v ' m=l x 7 

where aj is the number of allele classes that are sampled j times and 0(fc) = 0(0 + 1) • • • (0 + 
k — 1) with = 2iVw. The expectation of is therefore given by 
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where the last two approximate inequalities are accurate if k <C n and /c0 <C n, respec- 
tively. Hence the expected number i c of allele classes with more than n c members is roughly 
® 12k=n c ~ 0(l°g n — log0n c ), where cutting off the sum at k — n0~ x approximately 
accounts for the exponential. With this approximation, the i c th abundant allele class is 
expected to contain 

n 

n c « — exp(-i c /6) (13) 

copies of the allele. A more accurate expression of the spectrum is obtained by determining 
the n c such that i c = Ylk>n (°fc)> using the exact expression given above. This numerical 
solution for the haplotype spectrum is plotted in Figure 2B of the main text. 



B The distribution of haplotype frequencies 

In the main text, we calculated the distribution of the establishment time of the ith haplotype 
and the frequency of the corresponding haplotype. Here, we show how the joint distribution 
of all seeding times and the resulting frequency spectrum can be calculated assuming that 
the novel haplotypes are rare and evolve independently, which is justified if they constitute 
a small share of the total population, i.e., if u/s <C 1. In this case, the probability that k 
haplotypes i — 1, . . . , k are present in frequencies Xi is given by 

P(x u ...,x k \t)= [ T[dt i T\P(x i \t i ,t)P(t 1 ...,t k \t) , (14) 
Jo i i 

where P(xi\ti, t) is the probability that a haplotype has frequency Xi at time t given it became 
established at time U. The distribution of establishment times P{t\ . . . ,tk\t) is given by 

P(t 1 ...,t k \t) = ye-ti«°W 1 [[ a (t i ), (15) 
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where a(t') = 2suNx{t') is the rate of establishing novel adaptive haplotypes (main text 



Equation (1) and below). Note that the U defined in Equation (15) are not ordered. They 
are distributed according to a Poisson point process with density a(t'). Assuming that 
established novel haplotypes increase in frequency logistically according to Equation (5) of 
the main text, we have 

/ e O-«)(t-ti)\ 
P(*i\W) = S[*i- 2Na + e « ) > ( 16 ) 

where 5(x) is the Dirac 5-function (the stochastic analog is calculated below, see also (?)). 



Substituting P(xi\ti,t) into Equation (14) and integrating over ti, we obtain 



P(si,...,s*|f) = le-ZJ^COTT 1 a(t .) (17) 
k\ ±± (s — u)Xi 

i ' 

with ti = t-(s-u)^ 1 \og(xi(2Ns—e st )). Haplotypes that are common after the sweep are most 
likely seeded early during the sweep. Furthermore, we showed in Equation (5) of the main text 
that their relative frequencies stay approximately constant during the amplification phase. 
Hence we can determine the joint distribution of frequencies at early times t <C log2iVs 
while a(t) rs 2use st is still exponential. After substituting the ti and simplifying, we find 

P( Xl ,..., Xk \t)*—l[ I —j- r l[ X 7-^ (18) 

where we dropped factors independent of Xi which ensure normalization. A very similar result 
was found in (?). At large t, the form of the prefactor e~ changes due to the saturation of 
the allele frequency at 1, but the distribution of the frequencies of the haplotypes that were 
seeded early during the sweep remains of this form until the spectrum is eroded by genetic 
drift. 

The haplotype spectrum therefore decays with a power 2 + u/s, which is consistent with 
the power 1 — u/s obtained for the cumulative or rank spectrum (integrating x i 2 u ^ s yields 
x i 1 u ^ s ). More importantly, this result tells us that the distribution of haplotype frequencies 
conditional on the number of haplotypes observed is approximately independent of u/s if 
Hence, given that a sweep occurred, all information about the strength of the sweep 
is contained in the number of haplotypes and the precise values of their frequencies do not 
contain any additional information if u <C s. However, whenever there are deviations from 
the assumptions made here, the haplotype frequencies will contain additional information. 



C Stochastic derivation of the haplotype spectrum 

The dynamics of rare haplotypes are strongly influenced by random genetic drift and we 
have to ascertain the deterministic arguments made in the main text by a more careful 
stochastic calculation. While hard in general, an approximate analytic calculation of the 
frequency spectrum of rare haplotypes is feasible in our case for the following reasons: (i) 
The dynamics of a beneficial allele are essentially deterministic since it is much more frequent 
than haplotypes that arise through secondary mutations, (ii) The dynamics of rare haplotypes 
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can be described by a linear branching process since they are always a small fraction of the 
total population. 



As already done in Equation (14), we decompose the distribution of haplotype frequencies 
into the distribution P(t\, . . . , t^\t) of times when the novel haplotypes arise and probability 
P(n,t\to) that a haplotype is present in n copies at time t, given it arose at time to- We 
will derive P(n,t\to) first and consider the spectrum due to the superposition of several 
independent seeding events below. 

C.l Distribution of rare variants arising in a logistic sweep 

To model the stochastic dynamics of rare haplotypes, we use a continuous time branching 
process in which individuals produce identical copies of themselves with rate 1 + g(t) and die 
with rate 1, i.e., the unit of time is chosen to be the generation time. The average number 
of offspring of a given individual in this model is 1 + g(t). Hence, g(t) is the growth rate of 
the haplotype carrying the beneficial allele. In the case of a sweep, we have 

g(t) = s(l-x(t))-u , (19) 

where the first term accounts for selection (x(t) is the frequency of the beneficial allele) and 
the second term accounts for mutations that change the state of the haplotype. The dynamics 
of P(n,t\t ) are described by the forward Master equation 

d t P(n,t\t ) = (l+g(t))(n-l)P(n-l,t\t ) + (n + l)P(n + l,t\t )-(2+g(t))nP(n,t\t ) , (20) 

which accounts for replication (first term) and death (second term). To solve for P(n,t\to), 
it is useful to consider the generating function G(X,t\t ) = A n P(n, £|io), which obeys the 
equation 

d t G(X, t\to) = [-( 2 + 9 (*)) A + (1 + g(t))\ 2 + 1] d\G(\, t\t ) (21) 

with initial condition G(X, tol^o) = A. This equation can be solved via the method of charac- 
teristics, with the result 

G(A, t\t ) « 1 — , (22) 

e -^#) + (l-A)^fe-^" 5(f "' 

where we have used 1 + g(t) ~ 1 along the way. The latter is a good approximation if 
selection is weak in one generation and amounts to neglecting terms of order s 2 . We will 
now substitute the explicit expression for g(t), where it will be convenient to parametrize the 
frequency of the beneficial allele (t) = (1 + e^-^y 1 with r = s _1 log2iVs. Using this 
form of g(t), we find for the generating function 

5(1 - A)(l + e -s{t -r)y-u(t-t ) 

G{X,t\t ) - 1 - _ - 5e _ s(i _ r) + (1 _ A) [ e s(t -r)-u(t-t ) _ e s(t-r) + ^-1^ _ £-«(*-*„))] 

(23) 

where s = s — u. Any haplotype that is abundant enough to be sampled with high probability 
most likely originated in the early phase of the sweep (t r), which allows for the approx- 
imation 1 + e~ s (*°~' r ) ~ e~ s (*°~ r ) (1 + 0(n/(sN))) where n is the sample size. Furthermore, 
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we will typically observe the spectrum at times t ^> t when the sweep is almost complete. 
Hence we can approximate 1 + e~ s ^~ T ^ ~ 1 + xwt ~ 1 where xwt is the frequency of the 
deleterious wild type allele at the time of sampling. Using these simplifications, we obtain 

(1 _ \^ e -s(t -T)-u(t-t ) 

G(X, t\t ) « 1 - - - (1 _ A ^__ le _ s(t() _ r) _ u(t _ to) - u _ l{l _ e _ u( i_ to) ^ ( 24 ) 

This expression is straightforwardly expanded into a geometric series in A whose coefficients 
are P(n,t\t ). For large n, one finds 

e -s(t -T)-u(t-t ) e -s(to— r)-u(t-to) i _ e -u(t-t ) 

P(n,t\t ) « e~ n/h where h = + , (25) 

n 2 s — u u 

with relative corrections being on the order of n _1 . The quantity n is the mean copy number 
of the haplotype conditional on non-extinction, and the two terms contributing to n have a 
straightforward interpretation: The first term is the contribution of selection, which amplifies 
the haplotype before the fixation of the beneficial allele. The second term is the contribution 
of random genetic drift, which evaluates simply to t — to in the limit of u{t — to) ^ 1- The 
latter is the analog of the well known fact that a non-extinct neutral allele in a neutral Moran 
process is on average present in n copies after n generations. The expression for n exhibits 
a crossover from an early regime where selection dominates n to a random drift dominated 
regime at large t. In the limit u(t — 1 ) <C 1, we have 

s(t - t ) « e^-'°) 
s(t - t ) > e s ( r -*°) 

This crossover will inform us below about how long the contribution of random drift can be 
neglected when applying our estimator of the strength of the selective sweep. 




C.2 The haplotype frequency spectrum 

Having calculated the copy number distribution of a haplotype that originated at time t , we 
now have to determine the distribution of seeding times and calculate the resulting spectrum 
of haplotype frequencies. New haplotypes that contain the beneficial allele are produced at 
rate 

7 (t) = Nux(t) = i - ^ (t _ r) , (27) 

where, as before, x(t) is the frequency of the sweeping allele. Note that this differs from the 
rate of establishment of novel variants by a factor s, which will reemerge from the stochastic 
calculation. The deterministic approximation for jit) is valid if it is unlikely that new variants 
are seeded before establishment of the founding variant, which requires s ^> u (see ?). Since 
novel haplotypes are seeded and evolve independently to a good approximation, the number 
of haplotypes present in n copies at time t is Poisson distributed with mean 

Q(n,t)= [ dt -y(t )P(n,t\t ) , (28) 
Jo 
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Due to the exponential nature of P(n,t\to), it is convenient to sum Q(n,t) over n > n c and 
calculate the expected number of haplotypes with copy numbers greater than n c . The sum 
is well approximated by the integral W(n c , t) = f n dnQ(n,t), and we have 



W(n c ,t) = Nuj dn [ dt x(t )P(n,t\t ) 

Jo 

t -s(t -T)-u(t-t ) ft u (t-t ) 

Nu / dt — — r-e- n < /h « Nu / dt e~ nc/fl 

' n(l + e-*(*o-r)) J Q fi 



(29) 



where the last approximation assumes that novel haplotypes are seeded while the beneficial 
allele is still expanding exponentially, i.e., e~ s ^ to ~ T > 3> 1. 



It does not seem possible to evaluate the integral over i m Equation (29) analytically. 
However, the integral is dominated by contributions from a well defined time interval and 
can be evaluated perturbatively. For a very large population where drift is negligible, and 
for s 3> u, this integral simplifies to 

rt 

W(n c ,t)^Nu dtoe s{t °- T) - sncea(to ~ T) , (30) 



Jo 

which is very sharply cutoff for s(to— r) 3> — log sn c . Hence we can send the upper integration 
boundary to infinity without loss of accuracy and evaluate this integral exactly. One finds 
W(n c ,t) ~ Nu/(sn c ). The contribution to this integral come from a narrow peak of width 
s _1 and height Nue~ x jn c . Genetic drift and mutation predominantly change only this height 
and width, leaving the shape of the integral approximately invariant. Hence we can evaluate 
this integral by calculating where the integral peaks and how wide this peak is (Laplace's 
method). 

Including the correction due to drift and finite s/u term corrections, the integrand peaks 
when n c m h, which translates into s(r — £*) ~ log(s(n c — t + £*)) as opposed to s(r — £*) ~ 
log sn c without the corrections. The second derivative of the logarithm of the integrand is 
approximately given approximately by 

where At = t — t* is the age of the haplotypes. Hence the peak dominating the integral 
becomes wider by a factor -\ t - With these corrections to the "height" and the "width" of 
the integrand, we obtain the approximate expression 

Nue~ ut /N^ u/s 



W(n c ,t)n- —r - (32) 

s(n c -At) \n c J 

for the integral. This expression is accurate as long as sn c ^> s(t — r) — log(sn c ) and ut <C 1. 
The age of the haplotype evaluates approximately to At = t — r + s~ l log(sn c ), which is of 
order r. The additional factor (N/n c ) u ^ s accounts for the additional time the older haplotypes 
have been degraded by mutations, while the n c — At accounts for the contribution of drift to 
the frequency of the haplotypes. 
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After the sweep, the frequency of the most common haplotype is xq 
expected number of haplotypes above frequency x c is given by 



W(x c ,t) 



u 



X 



- At/N) 



x, 



l+u/s 



e ut and the 



(33) 



This result tells us that the mean number of haplotypes with frequencies greater than x c 
is approximately linear in u/s and decreases with x c approximately as x~ x . Furthermore, 
the expected number of haplotypes above x c is increasing with time, since rare haplotypes 
increase in frequency due to random drift. 

Given that we observe i c haplotypes at frequencies higher then x c , we can use Equa- 



tion (33) to estimate s/u: 



^ 

u i c 



x 



[X r 



At/N) 



l+i c x c /xo 



(34) 



This equation differs from Equation (9) of the main text by a reduction of x c due to random 
drift, which has been ignored in the simple deterministic derivation given in the main text. 
This reduction allows us to correct for the effects of genetic drift as long as drift is not to 
strong. Obviously, the correction fails as At approaches Nx c . 



Equation (34) informs us about the regimes where the proposed methods to estimate the 



selection coefficient is likely to work. Random genetic drift will degrade the signature of the 
sweep for haplotype frequencies smaller than At/N. Since the time needed for completion 
of the sweep is on the order of (logA^s)/s, we require Nsx c > logiVs. Since x c ~ u/(si c ) is 
itself small, we need Ns ^> i c log Ns for the method to work. The breakdown of the method 
is clearly seen in Figure 3 of the main text once Ns falls below 100. 



D Pruning recombinant haplotypes 

Haplotypes that arise by mutation from the founding haplotype differ at exactly one position 
from the founding haplotype, while haplotypes that result from a recombination event with 
a member of the diverse ancestral population typically differ at several positions. Further- 
more, haplotypes that are mutants of mutants will differ at two positions from the founding 
haplotype. In the main text, we argued that one can restrict the haplotype spectrum to 
those haplotypes that differ only at a single site from the founding haplotype, thereby re- 
moving most recombinant haplotypes and mutants of mutants. Here, we show that frequency 
spectrum of such a restricted set of haplotypes differs slightly from that of all haplotypes. 

As before, the frequency of the beneficial allele will typically follow Equation (1) of the 
main text. The frequency of the founding haplotype, however, will remain below this fre- 
quency due to loss through recombination and mutation. 

xM = 2N7T^> • (35) 

where we have abbreviated the initial growth rate of a haplotype by s = s — u — r. Mutations 
on this haplotype establish with rate a(t) = 2usNxo(t). Haplotypes that establish at time 
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ti then typically follow a frequency trajectory 



e s(t-u) 

Xi (t) = — 7 • (36) 

The most likely seeding time of the ith haplotype is given by 

U = \ log - . (37) 
s u 

Hence we obtain for the ratios of haplotype frequencies at times when the beneficial allele is 
near fixation 



(3t 



xo(t) si 

This differs from Equation (7) of the main text in that the ratio is proportional to i" 1 , rather 
than i~ l+u / s . This difference is due to the fact that here haplotypes grow with the same 
rate as the rate at which they are seeded. In the previous case where all haplotypes are 
considered, haplotypes grow with rate s — u — r, while the seeding rate is proportional to the 
frequency of the beneficial allele which grows with rate s. 




Figure 7: Estimating selection strength in presence of recombination. This figure is the 
analog of Fig 4 of the main text with a ten-fold reduced number of segregating sites in the 
initial samples, i.e., lower ancestral diversity. The estimates are very similar, showing that the 
ancestral diversity has no impact on the accuracy of the estimation as long as recombination 
events almost always give rise to unique haplotypes that differ from previous haplotypes at 
more than 1 site. 
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Figure 8: Comparison of the accuracy of our estimator to the accuracy of estimates obtained 



from the program sweepf inder (NIELSEN et al, 2005). Haplotype samples around a selective 



sweep occurring in the middle of the locus were simulated with the program msms (EwiNG 
and Hermisson, 2010). The population size used was N = 10 5 and samples were taken 



when the adaptive allele had reached frequency 0.99 in the population. Panel A: Mean and 
variance of estimates obtained from the program sweepf inder when applied to samples of 
depth 100 for a locus of size 2s /p as a function of different levels of ancestral diversity. Note 
that the length of the simulated locus should provide ample surrounding neutral sequence for 
sweepf inder, given that the dip in diversity is expected to be only of size 0.1s/ p. Panel B: 
Mean and variance of the estimates from our estimator (Equation 9 in the main manuscript) 
when applied to samples of depth 1000 for a locus of size 0.1s/ (p + p) as a function of different 
mutation rates. Analogously to Figure 3 in the main manuscript we used a cutoff of i c = 5 
for the analysis. Recombination rate was always p = 10~ 8 . Sweepf inder performs well only 
if the ancestral diversity is in the range 0.01 and selection coefficients exceed s = 0.001. 
Our method, in contrast, obtains reliable estimates regardless of the ancestral diversity and 
also for weaker selection coefficients. Note that the two methods were applied to different 
data sets (deep population samples of a short locus for our estimator vs. a longer locus at 
only moderate coverage for sweepf inder). The total amount of sequence provided to either 
method, however, was comparable. 
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